Scalable Feature Selection for Large Sized Databases

نویسنده

  • Huan Liu
چکیده

Feature selection determines relevant features in the data. It is often applied in pattern classiica-tion. A special constraint for feature selection nowadays is that the size of a database is normally very large. An eeective method is needed to accommodate the practical demands. A scalable probabilistic algorithm is presented here as an alternative to the exhaustive and heuristics approaches. The scalable probabilistic algorithm is designed and implemented to meet the needs arising from real-world data mining applications. Through experiments, we show that (1) the probabilistic algorithm is eeective in obtaining optimal/suboptimal subsets of features; (2) its scalable version expedites feature selection further and can scale up without sacriicing the quality of selected features.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Massively Parallel Unsupervised Feature Selection on Spark

High dimensional data sets pose important challenges such as the curse of dimensionality and increased computational costs. Dimensionality reduction is therefore a crucial step for most data mining applications. Feature selection techniques allow us to achieve said reduction. However, it is nowadays common to deal with huge data sets, and most existing feature selection algorithms are designed ...

متن کامل

Feature selection with test cost constraint

Feature selection is an important preprocessing step in machine learning and data mining. In real-world applications, costs, including money, time and other resources, are required to acquire the features. In some cases, there is a test cost constraint due to limited resources. We shall deliberately select an informative and cheap feature subset for classification. This paper proposes the featu...

متن کامل

Some Issues on Scalable Feature Selection

Feature selection determines relevant features in the data. It is often applied in pattern classiication, data mining, as well as machine learning. A special concern for feature selection nowadays is that the size of a database is normally very large, both vertically and horizontally. In addition, feature sets may grow as the data collection process continues. EEective solutions are needed to a...

متن کامل

An Overview of the New Feature Selection Methods in Finite Mixture of Regression Models

Variable (feature) selection has attracted much attention in contemporary statistical learning and recent scientific research. This is mainly due to the rapid advancement in modern technology that allows scientists to collect data of unprecedented size and complexity. One type of statistical problem in such applications is concerned with modeling an output variable as a function of a sma...

متن کامل

High Dimensional Feature Indexing Using Hybrid Trees

Feature based similarity search is emerging as an important search paradigm in database systems. The technique used is to map the data items as points into a high dimensional feature space which is indexed using a multidimensional data structure. Similarity search then corresponds to a range search over the data structure. Traditional multidimensional data structures (e.g., R-tree, KDB-tree, gr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998